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Abstract 

Within the educational context, students' assessment tests are routinely vali- 
dated through Item Response Theory (IRT) models which assume unidimensionality 
and absence of Differential Item Functioning (DIF). In this paper, we investigate if 
such assumptions hold for two national tests administered in Italy to middle school 
students in June 2009: the Italian Test and the Mathematics Test. To this aim, we 
rely on an extended class of multidimensional latent class IRT models characterised 
by: (i) a two-parameter logistic parameterisation for the conditional probability of 
a correct response, (ii) latent traits represented through a random vector with a dis- 
crete distribution, and (in) the inclusion of (uniform) DIF to account for students' 
gender and geographical area. A classification of the items into unidimensional 
groups is also proposed and represented by a dendrogram, which is obtained from 
a hierarchical clustering algorithm. The results provide evidence for DIF effects for 
both Tests. Besides, the assumption of unidimensionality is strongly rejected for 
the Italian Test, whereas it is reasonable for the Mathematics Test. 
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1 Introduction 



Italian National Tests for the assessment of primary, lower middle, and high-school stu- 
dents are developed and yearly collected by the National Institute for the Evaluation 
of the Education System (INVALSI). Before administration, national tests are validated 
through pretesting sessions. These preliminary data are analysed by standard Classical 
Test and Item Response Theory (IRT) models (Hambleton and Swaminathan, 1985). 

In this paper, we focus on the Tests administered to middle school students as they 
are having an increasing relevance in the Italian education context and their collection 
will become compulsory in the near future. In particular, we aim at studying if the 
assumptions of the IRT models used by the INVALSI to calibrate the national Tests are 
met for the "live" data collected by this Institution in June 2009, focusing in particular 
on the assumptions of unidimensionality and of no Differential Item Functioning (DIF). 
The data are based on a nationally representative sample of 27,592 students within 1,305 
schools (one class is sampled in each school) and refer to students' performances in two 
national tests, the Italian Test and the Mathematics Test, administered in June 2009. 

In accordance with the assumption of unidimensionality, which characterizes the most 
common IRT models, responses to a set of items only depend on a single latent trait 
which, in the educational setting, can be interpreted as the student's ability. However, if 
unidimensionality is not met, summarizing students' performances through a single score, 
on the basis of a unidimensional IRT model, may be misleading as test items indeed 
measure more than one ability. Absence of DIF means that the items have the same 
difficulty for all subjects and, therefore, difficulty does not vary among different groups 
defined, for instance, by gender or geographical area. 

In connection with the Rasch model, the hypothesis of unidimensionality has been 
extensively tested in the literature on the subject (Rasch, 1961; Glas and Verhelst, 2007; 
Verhelst, 2001). One of the main contributions has been developed by Martin-L6f (1973), 
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who proposed to test the hypothesis that the Rasch model holds for the whole set of items 
against the hypothesis that this model holds for two disjoint subsets of items defined 
in advance. Therefore, most statistical tests proposed in the literature are based on 
the assumptions that: (i) item discrimination power is constant and (ii) the conditional 
probability to answer a given item correctly does not vary across different groups. It is 
plausible that, given the complexity of the INVALSI study, these assumptions are not met 
for the INVALSI Test items as they may not discriminate equally well among subjects 
and may exhibit differential item functioning (DIF). 

In line with the above issues, we illustrate an extension of the class of multidimensional 
latent class IRT models developed by Bartolucci (2007) to include DIF effects. Specif- 
ically, we consider the version of these models based on a two-parameter logistic (2PL) 
parameterisation (Birnbaum, 1968) for the conditional probability of a correct response. 
The applied models are of latent class type, as they rely on the assumption that the pop- 
ulation under study is made up by a finite number of classes, with subjects in the same 
class having the same ability level (Lazarsfeld and Henry, 1968; Formann, 1995; Lindsay 
et al., 1991). Representing the ability distribution through a discrete latent variable is 
more flexible than representing it by means of a continuous distribution and is compatible 
with the assumption of multidimensionality, which means that the adopted questionnaire 
indeed measures more than one type of ability or dimension (Lazarsfeld and Henry, 1968; 
Formann, 1995; Lindsay et al., 1991). 

On the basis of the extended class of models described above, we analyse the 2009 IN- 
VALSI "live" data. These data are collected by two National Tests, which are developed 
to assess a number of different abilities, such as the ability to make sense of written texts, 
the ability to understand expressions and equations, and so on. As already mentioned, 
these Tests are of particular relevance in the Italian educational system; moreover, their 
reliability is nowadays deeply discussed. With reference to these data, in particular, we 
test the hypothesis of unidimensionality and that of absence of DIF. Moreover, we provide 
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a clustering of the items, so that the items in the same group are referred to the same 
ability. This is obtained by performing a sequence of Wald tests between nested multi- 
dimensional IRT models belonging to the proposed class. The results of this clustering 
procedure may be effectively illustrated by dendrograms. 

The remainder of this paper is organized as follows. In the next section we describe the 
INVALSI data used in our analysis. The statistical methodological approach employed to 
investigate the structure of the questionnaires is described in Section 3. Firstly, we recall 
the basics for the model adopted in our study (Bartolucci, 2007); then we show how it can 
be extended to take into account DIF effects. Details about the estimation algorithm and 
the use of these models to test unidimensionaly and absence of DIF are given in Section 
4. Finally, in Section 5, we illustrate the main results obtained by applying the proposed 
approach to the INVALSI dataset and in Section 6 we draw the main conclusions of the 
study. 

2 The 2009 INVALSI Tests 

In 2009, the INVALSI Italian Test included two sections, a Reading Comprehension sec- 
tion and a Grammar section. The first section is based on two texts: a narrative type 
text (where readers engage with imagined events and actions) and an informational text 
(where readers engage with real settings); see INVALSI (2009b). The comprehension 
processes are measured by 30 items, which require students to demonstrate a range of 
abilities and skills in constructing meaning from the two written texts. Two main types 
of comprehension processes were considered in developing the items: Lexical Competency, 
which covers the ability to make sense of worlds in the text and to recognize meaning 
connections among them, and Textual Competency, which relates to the ability to: (i) 
retrieve or locate information in the text, (ii) make inferences, connecting two or more 
ideas or pieces of information and recognizing their relationship, and (Hi) interpret and 
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integrate ideas and information, focusing on local or global meanings. The Grammar sec- 
tion is made of 10 items, which measure the ability of understanding the morphological 
and syntactic structure of sentences within a text. 

The INVALSI Mathematics Test consisted of 27 items covering four main content 
domains: Numbers, Shapes and Figures, Algebra, and Data and Previsions (INVALSI, 
2009c). The Number content domain consists of understanding (and operation with) 
whole numbers, fractions and decimals, proportions, and percentage values. The Algebra 
domain requires students the ability to understand, among others, patterns, expressions 
and first order equations, and to represent them through words, tables and graphs. Shapes 
and Figures covers topics such as geometric shapes, measurement, location and movement. 
It entails the ability to understand coordinate representations, to use spatial visualization 
skills in order to move between two and three dimensional shapes, draw symmetrical 
figures, and understand and being able to describe rotations, translations, and reflections 
in mathematical terms. The Data and Previsions domain includes three main topic areas: 
data organization and representation (e.g., read, organize and display data using tables 
and graphs), data interpretation (e.g., identify, calculate and compare characteristics 
of datasets, including mean, median, mode), and chance (e.g., judge the chance of an 
outcome, use data to estimate the chances of future outcomes). 

All items included in the Italian Test are of multiple choice type, with one correct 
answer and three distractors, and are dichotomously scored (assigning 1 point to correct 
answers and otherwise). The Mathematics Test is also made of multiple choice items, 
but it also contains two open questions for which a partial score of 1 was assigned to 
partially correct answers and a score of 2 was given to correct answers 1 . 



1 For the purposes of the analyses described in the following sections, the open questions of the Math- 
ematics Test were dichotomously re-scored , giving point to incorrect and partially correct answers and 
1 point otherwise. 



The two Tests were administered in June 2009, at the end of the pupils' compulsory ed- 
ucational period. Afterwards, a nationally representative sample made of 27,592 students 
was drawn through a stratified random sampling (INVALSI, 2009a). From each of the 21 
strata (the 21 Italian geographic regions) a sample of schools was drawn independently 
and allocation of sample units within each stratum was chosen proportional to an indica- 
tor based on the standard deviations of certain variables and the stratum sizes (Neyman, 
1934). Classes within schools were then sampled through a random procedure, with one 
class sampled in each school, without taking into account the class size (only schools with 
less than 10 students were excluded from the sampling procedure). Overall, 1305 schools 
(and classes) were sampled. Table 1 and Table 2 show the distribution of students per 
gender and geographic areas, respectively for the Italian Test and the Mathematics Test 2 . 



Gender 




Geographic 


area 








NW 


NE Centre 


South 


Islands 


Total 


Females 


1969 


2203 2099 


2194 


2173 


10638 


Males 


1922 


2155 2242 


2258 


2182 


10759 


Total 


3891 


4358 4341 


4452 


4355 


21397 



Table 1: Distribution of students per gender and geographic area for the INVALSI Italian 
Test. 



Gender 


NW 


Geographic 
NE Centre 


area 
South 


Islands 


Total 


Females 
Males 


1606 
1538 


1940 1786 
1804 1840 


1884 
1866 


1831 
1777 


8825 
9047 


Total 


3144 


3744 3626 


3750 


3608 


17872 



Table 2: Distribution of students per gender and geographic area for the INVALSI Math- 
ematics Test. 



2 Foreign students, students with disabilities and records with missing values were excluded from the 
dataset. 
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Preliminary analyses (see Table 3 and Table 4) confirm that students' performances 
on Test items were different on account of students' gender and geographic area. Overall, 
females performed better than males in the Italian Test, but worse than males in the 
Mathematics Test. In both Tests, average percentage scores per geographic area revealed 
very diverse levels of attainment. Generally, students from the Center of Italy performed 
better than the rest of the students in the Italian Test. 



Gender 




Geographic 


area 








NW 


NE Centre 


South 


Islands 


Overall 


Females 


75.0 


73.9 76.2 


75.2 


73.6 


74.8 


Males 


73.0 


71.4 73.1 


73.4 


71.0 


72.4 


Overall 


74.0 


72.6 74.6 


74.3 


72.3 


73.6 



Table 3: Average percentage score per gender and geographic area for the INVALSI Italian 
Test. 



Gender 


NW 


Geographic 
NE Centre 


area 
South 


Islands 


Overall 


Females 
Males 


73.3 
75.6 


71.9 75.6 
74.9 76.8 


77.5 
77.8 


76.3 
76.8 


75.0 
76.4 


Overall 


74.4 


73.4 76.2 


77.6 


76.5 


75.7 



Table 4: Average percentage score per gender and geographic area for the INVALSI Math- 
ematics Test. 



3 Methodological approach 

In this section, we illustrate the methodological approach adopted to investigate the 
presence of DIF and the dimension of the latent structure behind the analysed data. 
Firstly, we review the basic model proposed by Bartolucci (2007) and then we extend it 
to include DIF effects. 
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3.1 Preliminaries 

The multidimensional latent class (LC) IRT models developed by Bartolucci (2007) presents 
two main differences with respect to the classic IRT models: (i) the latent structure is 
multidimensional and (ii) it is based on latent variables that have a discrete distribution. 
We consider in particular the version of these models based on the two-parameter (2PL) 
logistic parameterisation of the conditional response probabilities. 

Let n denote the number of subjects in the sample and suppose that these subjects 
answer r dichotomous test items which measure s different latent traits or dimensions. 
Also let Jd, d — 1, . . . , s, be the subset of J = {1, . . . , r} containing the indices of the 
items measuring the latent trait of type d and let r d denoting the cardinality of this subset, 
so that r = Yld=i s d- Since we assume that each item measures only one latent trait, 
the subsets Jd are disjoint; obviously, these latent traits may be correlated. Moreover, 
adopting a 2PL parameterisation (Birnbaum, 1968), it is assumed that 

logit[p(Fjj = l\Qi = 0)] = 7i (^2 6 ^ d ~ P^J ' i = 1, ■ ■ ■ ,n, j = 1, . . . ,r. (1) 

In the above expression, Y^- is the random variable corresponding to the response to item 
j provided by subject i (Y^ = 0, 1 for wrong or right response, respectively). Moreover, (3j 
and 7j are, respectively, the difficulty and the discrimination of item j, ©« = (©a, • • • , @is)' 
is the vector of latent variables corresponding to the different traits measured by the test 
items, 6 = (9i, . . . , 6 S )' denotes one of the possible realizations of ©j, and 5jd is a dummy 
variable equal to 1 if item j belongs to Jd (and then it measures the dth latent trait) 
and to otherwise. Finally, a crucial assumption is that each random vector ©j has a 
discrete distribution with support {£i, •••,£&}, which correspond to k latent classes in 
the population. The elements of each vector £ c are denoted by £ c d, d = 1, . . . , s, with 
£ c <2 denoting the ability level of subjects in latent class c with respect to dimension d. 
Note that, when jj = 1 for all j, then the above 2PL parameterisation reduces to a 
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multidimensional Rasch parameterisation (Rasch, 1961). At the same time, when the 
elements of each support vector £ c are obtained by the same linear transformation of the 
first element, the model is indeed unidimensional even when s > 1. The last consideration 
will be useful in order to compute p-values for the test of unidimensionality. 

As for the conventional LC model (Lazarsfeld and Henry, 1968; Goodman, 1974), the 
assumption that the latent variables have a discrete distribution implies the following 
manifest distribution of the full response vector Y i = (Y n , . . . , Y ir )'\ 

k 

Pi(y) = p( Y i = y) = ^2pi(y I c)vr c , (2) 

c=l 

where y = (y±, . . . , y r )' denotes a realisation of Y iy n c = p(@i = £ c ) is the weight of the 
cth latent class, and 

r 

Pi (y I c) = p(Y t = y\ @ l = € c ) = l[p(Y ZJ = Vj | 0, = £ c ), c—l,...,k. (3) 

3=1 

The specification of the multidimensional LC 2PL model, based on the assumptions 
illustrated above, univocally depends on: (i) the number of latent classes (k), (ii) the 
number of the dimensions (s), and (Hi) the way items are associated to the different 
dimensions. The last feature is related to the definition of the subsets J d) d = 1, . . . , s. 

3.2 Extension for Differential Item Functioning 

DIF occurs when subjects belonging to different groups (commonly defined by gender, 
ethnicity, or geographic area) with the same latent trait level have a different probability 
of providing a certain answer to a given item (Thissen et al., 1993; Clauser and Mazor, 
1998; Swaminathan and Rogers, 1990). 

Even in the presence of a 2PL parameterisation, it reasonable to suppose that the main 
reason of DIF is due to the item difficulty level, which may depend on the individual 
characteristics of the respondent. More precisely, the presence of DIF in the difficulty 
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level of item j may be represented by shifted values of (3j for one group of subjects with 
respect to another. 

Let Zgi be a dummy variable which assumes value 1 if subject i belongs to group g 
(e.g., that of females) and value otherwise. The number of groups is denoted by h, so 
that, in the previous expression, g = 1, . . . , h. When s — 1, the 2PL parameterisation 
may be extended for DIF by assuming: 

logit \piYij = 1 | 6i = 9)} = 7j 

where (f> g j measures the shift for item j in terms of difficulty Therefore, if two subjects 
have the same ability level 9, but belong to two different groups, say g 1 and g 2 , the 
difference between the corresponding conditional probabilities of a correct response is 
(j) gi j — 4> 92 j on the logit scale. It can be observed that this difference between logits does 
not depend on the common latent trait value 9. In this case, the so-called uniform DIF 
arises; see Thissen et al. (1993), Clauser and Mazor (1998), and Swaminathan and Rogers 
(1990). 

Obviously, DIF in the difficulty level may be also introduced in the multidimensional 
case and when subjects are classified according to more criteria, to give an additive struc- 
ture to the corresponding DIF effects. More precisely, suppose that, as in our applications, 
subjects are grouped according to two criteria and that the first criterion gives rise to h\ 
groups, whereas the second gives rise to h 2 groups. Then, as an extension of (1), we have 

logit = 1 | 0, = 0)] = 7j 

for % — 1, . . . , n and j — 1, . . . , r. In the above expression, each dummy variable z g \' is 
equal to 1 if subject i belongs to group g (when the classification of subjects is based 
on the first criterion) and to otherwise; 4$ is the corresponding DIF parameter. The 

(2) (2) 

dummy variables z gi and the parameters <p g - are defined accordingly. These parameters 
may be simply interpreted as clarified above. 
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9- U + 5> 



.9=1 



l,...,n, j = l,...,k, (4) 



(i)ji) 



d=i 



hi 

<j=i 



(2) J2) 
93 hi 



(5) 



In the approach proposed in this paper, we rely on assumption (5) to extend the class 
of model of Bartolucci (2007) for DIF, even under the 2PL parameterisation and in the 
presence of multidimensionality. 

4 Likelihood based inference 

In this section, we deal with the maximum likelihood of the extended model based on 
assumption (5) and with the problem of selecting the number of latent states, and testing 
hypotheses on the parameter. The hypotheses of greatest interest in our context are those 
of absence of DIF and unimensionality. We also briefly outline the algorithm for clustering 
items in unidimensional groups. 

4.1 Maximum likelihood estimation 

Let y iy % = l,...,n, denote the response configuration provided by subject i. For a 
given k, the parameters of the proposed model may be estimated by maximizing the 
log-likelihood 

£(77) = 5>gh(^)], (6) 

i 

where t] is the vector containing all the free parameters, and Pi(y) is the manifest mass 
probability function of y defined in (2) on the basis of the model parameters. When 
subjects are classified according to only one criterion, an equivalent expression for the 
log-likelihood is the following 

h 

Kri) = J2Y, n ^y)^[p* 9 (y)l (7) 

9=1 y 

where n(g, y) is the frequency, in the sample, of subjects who belong to group g and pro- 
vide response configuration y, and p*(y) is the manifest probability of y for the subjects. 
Moreover, the sum is extended to all response configurations observed at least once. 
Similar expressions result when subjects are classified according to more criteria. 
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About the vector 77, we clarify that it contains the item parameters j3j (difficulty) and 
7j (discriminating index), and <fi g j (DIF parameters), the parameters ^ (ability levels) 
and 7r c (corresponding weights). However, to make the model identifiable, we adopt the 
constraints 

#w = >7j„ = l, d = l,...,s, 

with jd denoting a reference item for the o?-th dimension (usually, but not necessarily, the 
first one in the group). When subjects are classified according to only one criterion, we 
have 

<f>!j = Q, j = l,...,r, (8) 

where the first group is taken as reference group. In this way, for each item j, with 
j G (<Jd \ {jd}), the parameter (3j is interpreted in terms of differential difficulty level of 
this item with respect to item jd] similarly, 7^, is interpreted in terms of ratio between 
the discriminant index of item j and that of item jd- Finally, for g > 1, <p g j corresponds 
to the differential difficultly level of group g, with respect to the first group, for item j. 
When subjects are classified according to, say, two criteria and assumption (5) is adopted, 
then the identifiability constraints 

0g) = 0g.)=O, J = l,...,r, 

must be used instead in (8). 

Considering the above identifiability constraints, when k > 2 and subjects are classified 
according to a single criterion with regard to DIF, the number of free parameters collected 
in 77 is equal to 

#par = (k - 1) + ks + 2(r - s) + r(h - 1), 

since there are k — 1 free latent class probabilities, ks free ability parameters £, c d, r — s 
free difficulty parameters and discriminant indices, and r(h — 1) free DIF parameters. For 
k = 1, 2, the proposed model does not pose any restriction over the LC model and then we 
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have #par = (k — 1) + kr + r(h — 1). The number of parameters is simply modified when 
subjects are classified according to more criteria. For instance, when the classification is 
based on two criteria, and then assumption (5) holds, the term r(h — 1) in #par need to 
be substituted by r(hi + h 2 — 2). 

In order to maximise the log-likelihood £(rj), we make use of the Expectation-Maximization 
(EM) algorithm (Dempster et al., 1977), which is implemented along the same lines as in 
Bartolucci (2007). This algorithm is briefly described in Appendix 1; a Matlab imple- 
mentation is available from the authors upon request. The maximum likelihood estimate 
of 77, obtained from maximisation of Z(r)), is denoted by r). 

After the parameter estimation, each subject % can be allocated to one of the k latent 
classes on the basis of the response pattern y i he/she provided. The most common 
approach is to assign the subject to the class with the highest posterior probability. On 
the basis of the parameter estimates, the posterior probability is computed as 

Uc I Vi) = Pi&i = tc I = Vi ) = „F iy ^ e) *° , c=l,...,C. (9) 

4.2 Choice of the number of latent classes, hypothesis testing, 
and dimensionality assessment 

In analysing a dataset by the model described in Section 3, a crucial point is the choice of 
the number of latent classes k. To this aim, we rely on the Bayesian Information Criterion 
(BIC) of Schwarz (1978). On the basis of this criterion, the selected number of classes is 
the one corresponding to the minimum value of 

BIC = -21(f)) + log(n)#par. 

In practice, we fit the model for increasing values of k until BIC does not start to increase 
and then we take the previous value of k as the optimal one. 

Once the number of latent states has been selected, it is of interest to test several 
hypotheses on the parameters. To this aim, we can follow the general likelihood ratio 
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(LR) approach. For a hypothesis of type H : f(rj) — 0, where denotes a column vector 
of zeros of suitable dimension, this approach is based on the statistic 

d = -2[t(fi ) - em, (10) 

which, under the usual regularity conditions, has null asymptotic distribution of Xm type, 
where m is the number of constraints imposed by H . An alternative approach is based 
on the Wald test which is based on the statistic 

W = /(r))'G(r))/(r)), (11) 

where G(tj) is a suitable matrix computed on the basis of the Jacobian of f(r]) and 
the information matrix of the model. It is well known that the two approaches are 
asymptotically equivalent, and that, differently from the LR approach, the one based on 
the statistic W only requires to fit the larger model, but also to compute the information 
matrix of the model, which may be rather complex. 

On the basis of the above approach, we can test the hypothesis of absence of DIF. In 
this case, the null hypothesis is 

H :(f> gj = 0, g = 2, . . . , h, j = 1, . . . , r, 

or 

Ho : = ■■■ = C =<$ = ■■■ = C = 0, 3 = l,---,r, (12) 

when subjects are classified according to two criteria for what concerns DIF. Then, to test 
H Q , we have to fit the model with and without DIF and compare the corresponding log- 
likelihoods by (10). Thus, if the obtained value of test statistic is higher than a suitable 
percentile of the xL distribution, with m = r(h — 1), we reject H and can state that 
there is evidence of DIF. 

The above approach may also be used for the hypothesis that a group of items measure 
only one latent trait, that is unidimensionality, against the hypothesis that the same group 
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is multidimensional. In the case of two dimensions, for instance, we have to compare by 
the LR statistic (10) the model in which these dimensions are kept distinct with the model 
in which these dimensions are collapsed. Under the null hypothesis of unidimensionality, 
this test statistic has an asymptotic distribution of x 2 m type, with m = k — 2. This is 
because, as mentioned in Section 3.1, unidimensionality holds when the ability level for 
the second dimension may be obtained by the same linear transformation of the ability 
level for the first dimension, for every latent class c. Obviously, this test makes only sense 
when k > 2 and, in general, may also be performed by a Wald statistic of type (11), once 
the function f(rf) has been suitably defined; see Bartolucci (2007) for details. 

By repeating the test for unimensionality mentioned above in a suitable way, we can 
cluster items so that items in the same group measure the same ability. On the basis 
of this principle, Bartolucci (2007) proposed a hierarchical clustering algorithm that we 
also apply for the extended models here proposed, which account for DIF. This algorithm 
builds a sequence of nested models: the most general one is that with a different dimension 
for each item (corresponding to the classic LC model in absence of DIF) and the most 
restrictive model is that with only one dimension common to all items (unidimensional 
model). The clustering procedure performs s— 1 steps. At each step, the Wald test statistic 
for unidimensionality is computed for every pair of possible aggregations of items (or 
groups of items). The aggregation with the minimum value of the statistic (or equivalently 
the highest p-value) is then adopted and the corresponding model fitted before going to 
the next step. A similar strategy could be based on the LR statistic, but in this case we 
would be required to fit a much higher number of models. A Matlab implementation of 
this algorithm is also available from the authors upon request. 

The output of the above clustering algorithm may be displayed through a dendrogram 
that shows the deviance between the initial (/c-dimensional) LC model and the model 
selected at each step of the clustering procedure. Obviously, the results of a cluster analysis 
based on a hierarchical procedure depend on the adopted rule to cut the dendrogram, 
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which may be chosen according to several criteria. A rule that may be adopted to cut 
the dendrogram is based on the increase of a suitable information criterion, such as BIC, 
with respect to the initial or the previous fitted model. A negative increase of BIC means 
that the new model reaches a better compromise between goodness-of-fit and parsimony 
than the model used as a comparison term (i.e., the initial or the previous one). The 
dendrogram is cut when the item aggregation does not give any additional advantages, 
that is, in correspondence with the last step showing a negative increase. 

5 Application to the INVALSI dataset 

In this section, we apply the extended class of models to the data collected by the two 
INVALSI Tests. For the purposes of the analysis, the 30 items which assess reading 
comprehension within the Italian Test are kept distinct from the 10 items which assess 
grammar competency, as the two sections deal with two different competencies. Besides, 
since we do not have any prior information on item discrimination power, we choose 
the 2PL parameterisation and, regarding the way of taking DIF effects into account, we 
consider subjects classified according to gender {hi = 2 categories: Males, Female) and 
geographical area (h 2 = 5 categories: NorthWest, NorthEast, Centre, South, Islands). 
Then, the adopted parameterisation is the same as in (5). The categories Males and 
NorthWest are taken as reference categories. 

In the following, we deal with the selection of the number of latent classes, with the 
problem of testing the hypothesis of absence of DIF, and with the issue of clustering items. 

5.1 Selection of the number of classes 

In order to choose the number of latent classes we proceed as described in Section 4.2 and 
fit the model in the multidimensional version, in which each item is assumed to measure 
a single ability, for values of k from 1 to 9. The maximum value of k is chosen to be equal 
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to 9 as it is the first value for which BIC is higher than that associated to the previous 
value of k for all Test sections. The results of this preliminary fitting are reported in 
Table 5. 



k 


Reading 


comprehension 




Grammar 




Mathematics 




m 


#par 


BIC 


m 


#par 


BIC 




#par 


BIC 


1 


-350,474 


180 


702,743 


-100,842 


60 


202,282 


-242,111 


162 


485,808 


2 


-329,109 


211 


660,323 


-95,580 


71 


192,899 


-224,506 


190 


450,873 


3 


-326,171 


242 


654,760 


-95,645 


82 


192,110 


-221,976 


218 


446,090 


4 


-325,516 


273 


653,750 


-95,580 


93 


192,090 


-220,936 


246 


444,280 


5 


-324,970 


304 


652,970 


-95,517 


104 


192,070 


-220,032 


274 


442,750 


6 


-324,863 


335 


653,070 


-95,470 


115 


192,090 


-219,619 


302 


442,190 


7 


-324,764 


366 


653,178 


-95,464 


126 


192,184 


-219,248 


330 


441,730 


8 


-324,684 


397 


653,327 


-95,454 


137 


192,274 


-218,977 


358 


441,460 


9 


-324,583 


428 


653,436 


-95,429 


148 


192,334 


-218,846 


386 


441,470 



Table 5: Log-likelihood, number of parameters and BIC values for k = 1,...,9 latent 



classes for the Reading Comprehension and the Grammar sections of the Italian Test and 
for the Mathematics Test; in boldface is the smallest BIC value for each type of Test. 



On the basis of BIC, we choose k = 5 classes both for the Reading Comprehension 
and the Grammar sections of the Italian Test. As regards to Mathematics Test, despite 
k = 8 being the optimal number of classes, we choose k = 3, as for each number of 
classes greater than 3 the model becomes almost nonidentifiable, in the sense that the 
corresponding information matrix is close to be singular. We recall that this matrix is 
crucial for performing the Wald test for unidimensionality. 



5.2 Testing absence of DIF 

As previously specified, we define two groups of students on the basis of gender and geo- 
graphic area. The null hypothesis of no (uniform) DIF is formulated as in (12). At this 
regard, Table 6 shows the LR statistic, computed as in (10), between the 2PL multidimen- 
sional model with uniform DIF based on assumption (5) and the 2PL multidimensional 
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model based on assumption (1). 





Deviance 


p- value 


Reading Compr. 


1579.702 


<0.001 


Grammar 


1313.427 


<0.001 


Mathematics 


2183.573 


<0.001 



Table 6: Deviance of the multidimensional 2PL model with uniform DIF with respect to the 
multidimensional 2PL model with no DIF for the Italian Test - Reading Comprehension 
section and Grammar section - and the Mathematics Test. 

According to these results, the assumption of no DIF is strongly rejected for both 
sections of the Italian Test and for the Mathematics Test. Therefore, in Table 7, Table 8, 
and Table 9 we provide the estimates of the DIF coefficients (0^ and 4^])- We recall 
that each of these coefficients represents the difference, in terms of difficulty of an item, 
between one group of subjects with respect to the reference group, given the same ability 
level. 

The results in the previous tables show that the Italian Test generally favours girls; 
conversely, the items of the Mathematics Test tend to favour boys. When taking into 
account students' geographic area, we observe that the incidence of items affected by DIF 
is, on the whole, stronger for the southern regions (Islands included) than the central and 
northeastern regions, with a higher proportion of items significantly affected by DIF when 
accounting for the former geographic areas, both in the Italian Test and in the Mathe- 
matics Test. Specifically, as for the two sections of the Italian Test, the analysis shows 
that students from the southern regions tend to have a lower chance to answer the items 
correctly than students from the other Italian regions. On the contrary, Mathematics 
Test items generally tend to favour students from the South of Italy. 
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Item 


Females 


iNortnhjast 


Centre 


boutn 


Islands 


T> 1 

Kl 


n n i o 

-0.018 


n n r i 

-0.051 


n o/^o*** 

-0.2b2 


n i *7o** 

-0.173 


n noo 

0.032 


R2 


r\ ooo*** 

-0.322 


n 1 o o 

0.132 


n nn r 

-0.005 


n n'vo 

-0.073 


n 1 f?n 

0.170 


R3 


n no 1 

0.021 


n m i * 

0.211 


n nr^ 

-0.057 


n oi o* 

0.212 


n i o 1 

0.131 


T ) A 

K4 


C\ A A T*** 

-0.447 


n o r o* 

0.253 


n om ** 

0.291 


n ioo*** 

0.428 


r\ ni o*** 

0.bl3 


K5 


r"i o '"7'"7*** 

-0.377 


n 1 no* 

0.192 


n nn 1 

0.094 


n i i n 

0.110 


n oot** 

0.227 


Kb 


A 1 1 T** 

0.117 


n 1 o o 

0.132 


n i cr o* 

0.153 


n on r *** 

0.305 


r\ A r ^7*** 

0.457 


K7 


-0.662 


n no o 

0.083 


n n /i n 

0.040 


0.19b 


n /loo*** 

0.433 


R8 


-0.072 


0.002 


n i o^* 

0.127 


0.229 


0.159 


R9 


/A -1 ^70*** 

-0.170 


0.04b 


0.008 


0.078 


n i /i i * * 

0.141 


KID 


n o a n*** 

-0.340 


n oon* 

0.320 


n on 1 * 

-0.291 


n ,4 on** 

-0.420 


n cr o 1 *** 

-0.581 


Kll 


n i r n*** 

-0.159 


n no o 

0.038 


n n^r 

0.075 


n oT/n*** 

0.279 


n a i r *** 

0.415 


T> 1 O 

R12 


r\ ~i A o*** 

-0.148 


n nr* n 

0.0b9 


n no o 

-0.038 


n 0'"7'"7*** 

0.277 


n ooy*** 

0.227 


Rio 


n n r 

-0.057 


n nn o 

0.003 


n nno 

0.003 


n nor 

-0.035 


n i i 1 * 

0.111 


R14 


n nnr> 

0.09b 


n n 1 n 

0.019 


n n/'n 

-0.0b0 


n 1 it/?* 

0.17b 


n i c n* 

0.159 


Rio 


0.001 


n no/. 1 

-0.02b 


n n^yn 

-0.079 


n n'vn 

-0.079 


n noo 

0.028 


Rib 


nor o*** 

-0.352 


n o^n** 

0.270 


n oo'v*** 

-0.387 


n r'on*** 

-0.b82 


n rr* 1 *** 

-O.bbl 


R17 


n n ^7 * * 

-0.074 


n n r o 

0.058 


n nr* r 

-0.0b5 


n m t 

-0.017 


0.0b7 


R18 


n i nn** 

0.109 


n no/ 1 

0.03b 


n n^r 

0.075 


n ooo*** 

0.232 


nor n** 

0.350 


t~> 1 n 

Riy 


n o/">n** 

O.zbO 


r\ r\ A A 

(J.U44 


n nTr 

-0.075 


n on*** 


n c/ 1 /^*** 

-0.5bb 


R20 


n non 

0.029 


n n a n 

0.049 


n ooo*** 

0.283 


n onw* 

0.207 


n oor** 

0.23b 


T> O 1 

R21 


n i nr*** 

-0.195 


n nr* o 

-0.0b8 


n m o 

0.018 


n oot*** 

0.327 


n onn** 

0.290 


R22 


n i no*** 

-0.193 


n non 

0.020 


n noo 

-0.022 


n o 1 /"**** 

0.21b 


n o o /i *** 

0.334 


T ~) o O 

R23 


n n r ^ *** 

-0.254 


n nrn 

0.050 


n no r 

0.025 


n ,ioi*** 

0.431 


n a a ~\ *** 

0.441 


R24 


-0.245* 


0.223 


-0.282* 


-0.21b 


-0.0b7 


R25 


-0.0b8 


-0.043 


0.001 


-0.173* 


-0.05b 


R2b 


-0.319*** 


0.053 


-0.10b 


-0.158** 


O.lbO** 


R27 


-0.239*** 


-0.11b 


-0.079 


0.150 


0.313*** 


R28 


-0.28b*** 


0.094 


-0.105 


-0.215** 


-0.014 


R29 


-0.179*** 


-0.071 


-0.117 


0.008 


0.252 ** 


R30 


-0.405*** 


0.02b 


-0.119 


0.182* 


0.309*** 



Table 7: Estimated DIF coefficients for the Italian Test items - Reading Comprehension 
Section; significance at levels 0.001 (***), 0.01 (**), 0.05 (*). 



5.3 Dimensionality assessment 

Once the model which includes DIF has been adopted, with a specific k for each Test 
section (as defined in Section 5.1), we performed the item clustering algorithm described in 
Section 4.2. The output of this algorithm is represented by the dendrograms in Figures 1, 
2, and 3, which are referred, respectively, to the Reading Comprehension section of the 



Italian Test, to the Grammar section of the same Test, and to the Mathematics Test. 



item 


females 


iNortnhvast 


Centre 


south 


T 1 J 

islands 


Gl 


-0.404 


0.133 


0.073 


0.129 


0.428 


G2 


-0.272*** 


0.156* 


0.060 


0.051 


0.114 


G3 


-0.137*** 


0.059 


-0.191** 


-0.198** 


-0.002 


G4 


-0.052 


0.323*** 


0.004 


-0.679*** 


-0.350*** 


G5 


-0.328*** 


0.118* 


-0.111* 


-0.141** 


0.126* 


G6 


-0.261*** 


0.362*** 


-0.043 


-0.312*** 


0.072 


G7 


-0.323*** 


0.197*** 


-0.131* 


-0.287*** 


-0.011 


G8 


-0.309*** 


0.060 


0.144 


0.099 


0.524*** 


G9 


-0.205* 


-0.084 


-0.067 


0.290* 


0.705*** 


G10 


-0.167* 


0.269* 


-0.280* 


-0.504*** 


-0.324* 



Table 8: Estimated DIF coefficients for the Italian Test items - Grammar Section; signif- 
icance at levels 0.001 (***), 0.01 (**), 0.05 (*). 
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Figure 1: Dendrogram for the Italian Test - Reading Comprehension Section 



Following what outlined in Section 3.3, we adopt as a criterion to cut the dendrogram 
the one based on BIG In particular, since BIC tends to select more parsimonious models 
than other criteria (in particular with large sample sizes), and for consistency with the 
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0.013 
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-0.027 


0.058 


-0.051 
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0.045 
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-U.2UU 
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M4 


n nno 

0.008 


0.076 


n no 1 

-0.021 


n non 

-0.030 


n nnr 

0.00b 


i\ f r 

M5 


n no i 

0.024 


n 1 no 

0.102 


n 1 nn 

0.109 


n n cr t 

0.057 


n n/"*n 

-0.0b0 


TV if r* 

M6 


-0.002 


0.012 


-0.102 


0.072 


0.07b 


M7 


n no/ 1 

0.036 


n nn^7 

-0.097 


0.073 
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-0.090 


n i on*** 

-0.139 


A TO 

M8 


n noo 

0.022 


n n r o 

-0.058 


n nn i 

0.001 


n i /i /i 

0.144 


r\ i a o 

0.148 


TV If ( \ 

M9 


n n ^7 1 *** 

0.071 


n noo 

-0.022 


n nr/ 1 * 

-0.05b 


n nTo** 

-0.078 


n noT 

-0.027 


tv if i n 

M10 


n i no*** 

0.102 


n noo 

0.023 


n n a i 

-0.041 


n noo** 

-0.083 


n n^T** 

-0.077 


TV Ifl 1 

Mil 


n non* 

0.029 


n no i 

0.031 


n i on*** 

-0.120 


n o 1 n*** 

-0.310 


n o i n*** 

-0.310 


A /T1 O 

M12 


0.087 


n no A 

0.024 


-0.057 


-0.072 


0.022 


TV f 1 O 

Mid 


n n r r *** 

0.055 


n 1 n i *** 

0.101 


n no'v 

-0.027 


n nTo** 

-0.073 


n no /i 

-0.024 


TV lf~l A 

M14 


n n r - o** 

0.052 


n n/ 1 cr ** 

0.065 


n no i 

-0.031 


n i z^y 1 *** 

-O.lbb 


n i no*** 

-0.193 


TV hi r 

M15 


n n r o** 

0.052 


0.037 


n n 1 o 

-0.012 


n c\ A'i 

0.047 


n nno 

-0.002 


IV If 1 f* 

M16 


n 1/^1 *** 

0.161 


n non 

0.030 


n n 1 n 

0.019 


n nno 

-0.008 


n n i n 

-0.019 


Ml 7 


0.154 


-0.001 


n net! 

-0.05b 


-0.0b0 


n m o 


TV Ifl O 

M18 


n n /i n*** 

0.049 


-0.001 


-0.023 


-0.022 


n nor 

-0.025 


TV If 1 C\ 


n nno 

-0.008 


-O.OOfa 


n noo 

0.032 


n 1 no*** 

0.103 


n i oo*** 

0.183 


TV /Ton 

M20 


n no i 

-0.024 


n non 

0.029 


n nn'v 

-0.007 


n n cr r * 

-0.055 


n nn/ 1 

-0.00b 


M21 


0.112*** 


0.104** 


0.023 


-0.034 


-0.022 


M22 


0.033 


0.078* 


-0.036 


0.001 


-O.ObO* 


M23 


0.013 


0.049* 


-0.049* 


-0.062** 


-0.057** 


M24 


-0.012 


-0.008 


-0.097*** 


-0.143*** 


-0.125*** 


M25 


-0.035** 


-0.013 


0.033 


0.032 


0.153*** 


M26 


0.087*** 


0.089** 


-0.058 


-0.163*** 


-0.083* 


M27 


0.010 


0.093** 


-0.074 


-0.269*** 


-0.312*** 



Table 9: Estimated DIF coefficients for the Mathematics Test items; significance at levels 
0.001 (***), 0.01 (**), 0.05 (*). 



criterion applied to select the number of latent classes, we rely on the increase of BIC 
with respect to the initial model (i.e., the model with one dimension for each item). The 
values of the increase of BIC with respect to the initial model are shown in Table 10; 
note that the number of steps of the clustering algorithm depends on the number of items 
which are analysed. 
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Figure 2: Dendrogram for the Italian Test - Grammar Section 
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Figure 3: Dendrogram for the Mathematics Test 

The results in Table 10 show that, with the adopted cut criterion and the chosen 
number of latent classes, the assumption of unidimensionality is not reasonable for both 
sections of the Italian Test, and in particular for the Grammar section, whereas it is 
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reasonable for the Mathematics Test. Indeed, there is evidence of s = 2 groups of items 
in the Reading Comprehension Section of the Italian Test, s = 5 groups of items in the 
Grammar Section of the Italian Test, and s = 1 group of items in the Mathematics Test. 
The 2 groups observed within the Reading Comprehension Section of the Italian Test are 
made of 24 and 6 items, corresponding to different, although correlated, dimensions which 
may be identified as the ability to: (i) make sense of worlds and sentences in the text 
and recognize meaning connections among them (24 items) and (ii) interpret, integrate 
and make inferences from a written text (6 items). As regards to the Grammar Section 
of the Italian Test, the 5 groups of items correspond to the ability to: (i) recognize verb 
forms (1 item), (ii) recognize the meaning of connectives within a sentence (3 items), (Hi) 
recognize grammatical categories (2 items), (iv) make a difference between clauses within 
a sentence (2 items), and (v) recognize the meaning of punctuation marks (2 items). 

From Table 11, which shows the support point estimates for the two sections of the 
Italian Test and the Mathematics Test, it can be also shown that, overall, students' 
belonging to the higher latent classes is linked with increasing ability levels. 

Indeed, students belonging to class 5 within the two sections of the Italian Test, 
and to class 3 within the Mathematics Test, tend to have the highest ability level in 
relation with the involved dimensions, whereas students' belonging to the first latent 
class is generally associated with lower ability levels. These considerations hold for each 
dimension but for the first dimension of the Reading Comprehension Section and the 
third dimension of the Grammar Section - where higher than expected ability levels are 
observed in correspondence with middle latent classes - and for the fifth dimension of the 
Grammar Section - where the support point estimate observed by the first latent class is 
not the lowest. 
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Table 10: Diagnostics for the hierarchical clustering algorithm for the Italian Test - Read- 
ing Comprehension section and Grammar section - and the Mathematics Test: step of 
the procedure (h), number of groups (s), increase of BIC index with respect to the initial 
model; in boldface are the first positive values. 



6 Conclusions 



The main objective of this paper is to evaluate the dimensionality of two national Tests 
employed to assess middle school Italian students' performance, testing for the assumption 
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1 


2 


c 
3 


4 


5 


Reading Comprehension 












Dimension 1 


-1.193 


0.221 


-0.329 


1.012 


2.776 


Dimension 2 


-1.404 


-0.859 


-0.049 


0.646 


1.378 


Grammar 












Dimension 1 


-0.334 


2.244 


2.536 


2.948 


4.363 


Dimension 2 


-0.853 


-0.786 


0.812 


0.935 


2.807 


Dimension 3 


-0.827 


-0.554 


-2.068 


0.598 


2.384 


Dimension 4 


0.782 


1.224 


2.012 


2.507 


3.735 


Dimension 5 


-0.616 


-1.069 


-0.623 


-0.056 


1.364 


Mathematics 
Dimension 1 


0.995 


1.509 


2.060 







Table 11: Support points estimates for the Italian Test - Reading Comprehension section 
and Grammar section - and the Mathematics Test 



of unidimensionality which characterizes most Item Response Theory models used to 
validate assessment data. We also test if the assumption of absence of Differential Item 
Functioning (DIF) is reasonable for these data. The data were collected in 2009 by the 
National Institute for the Evaluation of the Education System (INVALSI) and refer to two 
assessment Tests - on Italian language competencies (Reading comprehension, Grammar) 
and Mathematical competencies - administered to middle-school students. 

We base our analysis on a class of multidimensional latent class IRT models which 
allows us to test unidimensionality by concurrently taking into account the presence of 
DIF and that the items may have non-constant item discrimination power. This class of 
models is obtained as an extension for (uniform) DIF of the class of multidimensional two- 
parameter logistic (2PL) models developed by Bartolucci (2007). The inclusion of DIF 
effects has proven opportune as the hypothesis of absence of these effects was strongly 
rejected for both Tests here considered. Moreover, as known, Tests containing items 
affected by DIF and, thus, functioning differently for respondents who belong to different 
groups, may have a reduced validity. In the context of this study, the soundness of 
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between-group comparisons is trimmed down by the dependance of students' scores on 
attributes other than those the scale is intended to measure, that is students' gender and 
geographical area. 

Concerning the hypothesis of unidimensionality, the advantage of the applied approach 
with respect to other approaches is that it can be employed when the items discriminate 
differently among subjects. Within the present analysis, relying on a 2PL parameterisa- 
tion has been justified by the lack of any prior information on discriminating power of the 
test items. 

To test the assumption of unidimensionality, we compare a unidimensional model with 
a multidimensional counterpart with the same 2PL parameterisation, the same number 
k of latent classes, and the same DIF structure, relying on a Wald test statistic. Subse- 
quently, we cluster items in different unidimensional groups. The classification algorithm 
performed under this set-up showed that the assumption of unidimensionality is not sup- 
ported by the data for the Italian Test, while it can be accepted for the Mathematics Test. 
Therefore, while summarizing students' performances on the Mathematics Test through 
a single score is appropriate, a single score cannot be sensibly used to describe students' 
attainment on the Italian Test (especially on the Grammar section), as the difference 
among students' does not depend univocally on a single ability level. 

Appendix 1: EM algorithm for model estimation 

The complete log-likelihood, on which the EM algorithm is based, may be expressed as 

k h 

f fa) = J2J2T, n ^^y) l og[p* g (y\c)n c ], (13) 
c=i 9=1 y 

which is directly related to the incomplete log-likelihood defined in (7), and where n(c, g, y) 
denotes the number of subjects providing response configuration y and belonging to latent 
class c and to group g, whereas p*(y\c) corresponds to the conditional probability defined 
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in (3) for a subject belonging to the g-ih group. 

Usually, t{t]) is much easier to maximize with respect of £(77). However, since the 
frequencies n(c, g, y) are not known, the EM algorithm alternates the following two steps 
until convergence in t{rj): 

• E-step. It consists of computing the expected value of the complete log-likelihood 
tirf); this is equivalent to substituting each frequency m(c,g,y) with its expected 
value 

p g (y\ c ) 



™>{c,g,y) = n(g,y) 



7T„ 



J2hPg(y\ h ) 7r h 

under the current value of the parameters. 

• M-step. It consists of updating the model parameters by maximizing the expected 
value of £*(t}). More precisely, for the weights n c an explicit solution exists which 
is given by 

7T C — , C — 1, . . . , K. 

n 

About the other parameters, since an explicit solution does not exist, an iterative 
optimization algorithm of Newton-Raphon type may be used. The resulting esti- 
mates of 77 are used to update m(c, g, y) at the next E-step. 

When the algorithm converges, the last value of 77, denoted by 77, corresponds to 
the maximum of £(rj) and then it is taken as the maximum likelihood estimate of this 
parameter vector. It is important to highlight that the running time and, in particular, 
the detection of a global rather than a local maximum point crucially depend on the 
initialization of the EM algorithm. Therefore, following Bartolucci (2007), we recommend 
to try several initializations of this algorithm that may be formulated in terms of initial 
expected frequencies rh(c,g,y). These frequencies may be obtained by multiplying each 
observed frequency n(g,y) by a given constant a c (y) depending on the total score (i.e., 
the sum of the elements in y). These constants must satisfy the obvious constraints 
a c (y) > 0, c = 1, . . . , k, and J2 C a c (y) = 1 for all y. 
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